Introduction

This notebook explores using CopyKAT to estimate tumor and normal cells in SCPCS000490 from SCPCP000015.

CopyKAT was run using the run-copykat.R script and specifying endothelial cells from both SingleR and CellAssign as the normal reference. This produced results from CopyKAT both with and without a normal reference. These results are read into this notebook and used to:

Setup

suppressPackageStartupMessages({
  # load required packages
  library(SingleCellExperiment)
  library(ggplot2)
  library(copykat)
})
# The base path for the OpenScPCA repository, found by its (hidden) .git directory
repository_base <- rprojroot::find_root(rprojroot::is_git_root)

# The current data directory, found within the repository base directory
data_dir <- file.path(repository_base, "data", "current")
sample_dir <- file.path(data_dir, "SCPCP000015", params$sample_id)

# The path to this module
module_base <- file.path(repository_base, "analyses", "ewings-cell-type")
# source in helper functions for make_jaccard_matrix() and jaccard()
jaccard_functions <- file.path(module_base, "scripts", "utils", "jaccard-functions.R")
source(jaccard_functions)
# Input files
sce_filename <- glue::glue("{params$library_id}_processed.rds")
sce_file <- file.path(sample_dir, sce_filename)

# tumor/normal classifications 
classifications_filename <- glue::glue("{params$library_id}_tumor_normal_classifications.tsv")
classifications_file <- file.path(module_base, "results", "marker_gene_analysis", classifications_filename)

# copykat results 
results_dir <- file.path(module_base, "results", "copykat", params$library_id)

# predictions files 
predictions_file <- glue::glue("{params$library_id}_copykat_prediction.txt")

predictions_paths <- c(
  no_ref = file.path(results_dir, "no_normal", predictions_file),
  with_ref = file.path(results_dir, "endothelial_normal", predictions_file)
)

# full results gene by cell 
full_ck_result_file <- glue::glue("{params$library_id}_copykat_CNA_raw_results_gene_by_cell.txt")
  
full_ck_result_paths <- c(
  no_ref = file.path(results_dir, "no_normal", full_ck_result_file),
  with_ref = file.path(results_dir, "endothelial_normal", full_ck_result_file)
)
# read in processed sce
sce <- readr::read_rds(sce_file)

# read in tumor normal classifications 
manual_classifications_df <- readr::read_tsv(classifications_file)
## Rows: 4290 Columns: 3
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): barcodes, marker_gene_classification, cluster_classification
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read in ck predictions from both reference types (no_normal and with_normal)
ck_results_df <- predictions_paths |> 
  purrr::map(readr::read_tsv) |> 
  dplyr::bind_rows(.id = "reference_used")
## Rows: 4290 Columns: 2
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): cell.names, copykat.pred
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 4290 Columns: 2
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): cell.names, copykat.pred
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read in full gene by cell copy number detection results 
full_ck_results_df <- full_ck_result_paths |> 
  purrr::map(readr::read_tsv) |> 
  dplyr::bind_rows(.id = "reference_used")
## Rows: 6860 Columns: 4210
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr    (3): ensembl_gene_id, hgnc_symbol, band
## dbl (4207): abspos, chromosome_name, start_position, end_position, AAGCATCTCGTTGCCT, AGGTCTAAGGGACTGT, CTGTATTTCCAAGAGG, AATTTCCTCT...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 6860 Columns: 4210
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr    (3): ensembl_gene_id, hgnc_symbol, band
## dbl (4207): abspos, chromosome_name, start_position, end_position, AAGCATCTCGTTGCCT, AGGTCTAAGGGACTGT, CTGTATTTCCAAGAGG, AATTTCCTCT...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

CopyKAT results

Below we look at the heatmaps produced by CopyKAT.

Heatmap without reference

Heatmap with endothelial cells as reference

Both of these heatmaps appear to show some increase in observed CNVs in aneuploid vs. diploid which is to be expected.

UMAP

Below we prepare and plot a UMAP that shows which cells are classified as diploid, aneuploid, and not defined by CopyKAT. We show a side by side UMAP with results from running CopyKAT both with and without a reference of normal cells.

umap_df <- sce |> 
  scuttle::makePerCellDF(use.dimred = "UMAP") |> 
  # replace UMAP.1 with UMAP1
  dplyr::rename_with(
        \(x) stringr::str_replace(x, "^UMAP\\.", "UMAP")
      )

cnv_df <- umap_df |> 
  # first add manual annotations
  dplyr::left_join(manual_classifications_df) |> 
  # now add copykat results
  dplyr::left_join(ck_results_df, by = c("barcodes" = "cell.names"))
## Joining with `by = join_by(barcodes)`
ggplot(cnv_df, aes(x = UMAP1, y = UMAP2, color = copykat.pred)) +
  geom_point(alpha = 0.5, size = 0.5) +
  theme_bw() +
  facet_wrap(vars(reference_used))

If we look at the UMAPs shown in 01-marker-gene-tumor-classification.Rmd, the cells that are labeled aneuploid here are consistent with where we saw marker gene expession for tumor cells. It also looks like there is not a visible difference between using a reference of normal cells and not using a reference.

Validate common CNAs found in Ewing sarcoma

To validate some of these annotations, we can also look at some commonly found copy number variations found in Ewing sarcoma patients. There are a few known copy number variations in Ewing’s sarcoma:

  • Gain of Chr8
  • Gain of Chr12
  • Gain of Chr1p
  • Loss of Chr16q

Although these are the most frequent, there are patients who do not have any of these alterations and patients that only have some of these alterations. See Tirode et al., and Crompton et al..

CopyKAT outputs a matrix that contains the estimated copy numbers for each gene in each cell. We can read that in and look at the mean estimated copy numbers for each chromosome across each cell. We might expect that tumor cells would show an increased estimated copy number in Chr8, Chr12, and/or Chf 1 and a loss of Chr16.

# for every cell, calculate the mean detection level across all genes in a given chromosome
full_cnv_df <- full_ck_results_df |> 
  tidyr::pivot_longer(cols = -c(
    reference_used,
    abspos, 
    chromosome_name, 
    start_position, 
    end_position, 
    ensembl_gene_id, 
    hgnc_symbol, 
    band),
    names_to = "barcodes",
    values_to = "cnv_detection") |> 
  dplyr::group_by(chromosome_name, barcodes, reference_used) |> 
  dplyr::summarise(mean_cnv_detection = mean(cnv_detection)) 
## `summarise()` has grouped output by 'chromosome_name', 'barcodes'. You can override using the `.groups` argument.
# filter to only have cnvs found in Ewings to keep things a bit smaller when joining
ewings_chrs <- c(1, 8, 12, 16)

ewings_cnv_df <- full_cnv_df |> 
  dplyr::filter(chromosome_name %in% ewings_chrs)

# join with 
cnv_df <- cnv_df |> 
  dplyr::left_join(ewings_cnv_df, by = c("barcodes", "reference_used"))
# create faceted UMAPs showing estimation of CNV detection across each chr of interest 
ewings_chrs |> 
  purrr::map(\(chr){
    plot_df <- cnv_df |> 
      dplyr::filter(chromosome_name == chr)
    
    ggplot(plot_df, aes(x = UMAP1, y = UMAP2, color = mean_cnv_detection)) +
      geom_point(alpha = 0.5, size = 0.5) +
      theme_bw() +
      scale_color_viridis_c() +
      facet_wrap(vars(reference_used)) + 
      labs(title = glue::glue("chr{chr}"))
  })
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

Looking at the UMAPs, I don’t see any really obvious differences between CNV estimation and groups of cells. Let’s look at the distribution of CNV estimation in cells that are called aneuploid and diploid by CopyKAT.

# create faceted density plots showing estimation of CNV detection across each chr of interest 
# colored by aneuploid/diploid estimation 
ewings_chrs |> 
  purrr::map(\(chr){
    plot_df <- cnv_df |> 
      dplyr::filter(chromosome_name == chr)
    
    ggplot(plot_df, aes(x = mean_cnv_detection, color = copykat.pred)) +
      geom_density() +
      theme_bw() +
      facet_wrap(vars(reference_used)) + 
      labs(title = glue::glue("chr{chr}"))
  })
## [[1]]

## 
## [[2]]

## 
## [[3]]

## 
## [[4]]

It looks like there is a slight increase in chr8 and chr12 indicating maybe a copy number gain for those chromosomes in aneuploid cells, but nothing striking.

Confusion Matrix

Below we directly compare the annotations obtained using manual classification of tumor and normal cells to annotating cells with CopyKAT. To do this, we will calculate the confusion matrix using caret::confusionMatrix().

filtered_cnv_df <- cnv_df |> 
  dplyr::filter(copykat.pred != "not.defined")
  
caret_df_list <- filtered_cnv_df |> 
  dplyr::mutate(copykat = ifelse(
    # use str_detect for test data
    copykat.pred == "diploid", "Normal", "Tumor"
  )) |> 
  # make tumor the positive class
  dplyr::mutate(
    copykat = forcats::fct_relevel(copykat, "Tumor"),
    marker_gene_classification = forcats::fct_relevel(marker_gene_classification, "Tumor")
  ) |> 
  split(cnv_df$reference_used)
## Warning in split.default(x = seq_len(nrow(x)), f = f, drop = drop, ...): data length is not a multiple of split variable
caret_df_list |> 
  purrr::imap(\(df, ref_type){
    caret::confusionMatrix(
      table(
        df$marker_gene_classification, 
        df$copykat)
    )
  })
## $no_ref
## Confusion Matrix and Statistics
## 
##         
##          Tumor Normal
##   Tumor   8832    403
##   Normal  1456   6122
##                                           
##                Accuracy : 0.8894          
##                  95% CI : (0.8846, 0.8941)
##     No Information Rate : 0.6119          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7739          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8585          
##             Specificity : 0.9382          
##          Pos Pred Value : 0.9564          
##          Neg Pred Value : 0.8079          
##              Prevalence : 0.6119          
##          Detection Rate : 0.5253          
##    Detection Prevalence : 0.5493          
##       Balanced Accuracy : 0.8984          
##                                           
##        'Positive' Class : Tumor           
##                                           
## 
## $with_ref
## Confusion Matrix and Statistics
## 
##         
##          Tumor Normal
##   Tumor   8968    277
##   Normal  1536   6030
##                                           
##                Accuracy : 0.8922          
##                  95% CI : (0.8874, 0.8968)
##     No Information Rate : 0.6248          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7788          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8538          
##             Specificity : 0.9561          
##          Pos Pred Value : 0.9700          
##          Neg Pred Value : 0.7970          
##              Prevalence : 0.6248          
##          Detection Rate : 0.5335          
##    Detection Prevalence : 0.5499          
##       Balanced Accuracy : 0.9049          
##                                           
##        'Positive' Class : Tumor           
## 

Looking at these results, we see that cells classified as tumor cells using marker genes are mostly overlapping with cells that are aneuploid in CopyKAT, which is what we expect! We also see that the Kappa values are close to 1, indicating almost perfect agreement.

It also looks like there isn’t a difference between using the reference and not using the reference. The reference Kappa is .1 higher, but I don’t think that’s enough to argue that you need to use a reference. The one thing to note is that using a reference cut the run time for CopyKAT in half, so that is one benefit.

We can also calculate the Jaccard similarity index to visualize the amount of cells that have overlapping annotations. Given the confusion matrix, I would expect to see good agreement between marker gene and CopyKAT annotations.

# calculate Jaccard similarity index for each reference type
jaccard_matrices <- caret_df_list |>
  purrr::map(\(df) {
    
    make_jaccard_matrix(
      df,
      "marker_gene_classification",
      "copykat.pred"
    )
  })
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))

# heatmaps comparing tumor/normal annotations manually vs. copyKAT
heatmap <- jaccard_matrices |>
  purrr::imap(
    \(jaccard_mtx, ref_type) {
      ComplexHeatmap::Heatmap(
        t(jaccard_mtx), # transpose because matrix rows are in common & we want a vertical arrangement
        col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
        border = TRUE,
        ## Row parameters
        cluster_rows = TRUE,
        row_title = ref_type,
        row_title_gp = grid::gpar(fontsize = 12),
        row_title_side = "left",
        row_names_side = "left",
        row_dend_side = "right",
        row_names_gp = grid::gpar(fontsize = 10),
        ## Column parameters
        cluster_columns = FALSE,
        column_title = "",
        column_title_gp = grid::gpar(fontsize = 12),
        column_names_side = "bottom",
        column_names_gp = grid::gpar(fontsize = 10),
        column_names_rot = 90,
        ## Legend parameters
        heatmap_legend_param = list(
          title = "Jaccard index",
          direction = "vertical",
          legend_width = unit(1.5, "in")
        ),
        show_heatmap_legend = ref_type == "no_ref",
      )
    }) |>
  # concatenate vertically into HeatmapList object
  purrr::reduce(ComplexHeatmap::`%v%`) |>
  ComplexHeatmap::draw(
    heatmap_legend_side = "right",
    # add a margin to the heatmap so labels don't get cut off
    padding = unit(c(2, 20, 2, 2), "mm")
  )

As expected, it looks like cells classified as tumor cells using marker gene annotation are classified as aneuploid with CopyKAT.

Compare CopyKAT to SingleR and CellAssign

Lastly, we will compare the annotations from CopyKAT to those obtained using SingleR and CellAssign by calculating the Jaccard similarity index. For this comparison we will use just the annotations from CopyKAT with no reference.

celltype_columns <- c(
  "singler_celltype_annotation",
  "cellassign_celltype_annotation"
)

# filter to only get annotations from no ref
no_ref_only <- cnv_df |> 
  dplyr::filter(reference_used == "no_ref")

# create jaccard matrices for SingleR and CellAssign compared to aneuploid/diploid 
jaccard_matrices <- celltype_columns |>
  purrr::map(\(name) {
    make_jaccard_matrix(
      no_ref_only,
      "copykat.pred",
      name
    )
  }) |> 
  purrr::set_names("SingleR", "CellAssign")
# Set heatmap padding option
heatmap_padding <- 0.2
ComplexHeatmap::ht_opt(TITLE_PADDING = grid::unit(heatmap_padding, "in"))

# list of heatmaps looking at SingleR/ CellAssign vs tumor/normal 
heatmap <- jaccard_matrices |>
  purrr::imap(
    \(celltype_mat, celltype_method) {
      ComplexHeatmap::Heatmap(
        t(celltype_mat), # transpose because matrix rows are in common & we want a vertical arrangement
        col = circlize::colorRamp2(c(0, 1), colors = c("white", "darkslateblue")),
        border = TRUE,
        ## Row parameters
        cluster_rows = TRUE,
        row_title = celltype_method,
        row_title_gp = grid::gpar(fontsize = 12),
        row_title_side = "left",
        row_names_side = "left",
        row_dend_side = "right",
        row_names_gp = grid::gpar(fontsize = 10),
        ## Column parameters
        cluster_columns = FALSE,
        column_title = "",
        column_title_gp = grid::gpar(fontsize = 12),
        column_names_side = "bottom",
        column_names_gp = grid::gpar(fontsize = 10),
        column_names_rot = 90,
        ## Legend parameters
        heatmap_legend_param = list(
          title = "Jaccard index",
          direction = "vertical",
          legend_width = unit(1.5, "in")
        ),
        show_heatmap_legend = celltype_method == "SingleR",
      )
    }) |>
  # concatenate vertically into HeatmapList object
  purrr::reduce(ComplexHeatmap::`%v%`) |>
  ComplexHeatmap::draw(
    heatmap_legend_side = "right",
    # add a margin to the heatmap so labels don't get cut off
    padding = unit(c(2, 20, 2, 2), "mm")
  )

We see that aneuploid cells are mostly muscle cells in both CellAssign and SingleR. In SingleR there are some other cell types that were assigned that have a few anueploid cells (chondrocytes, neurons, and fibroblasts).

For diploid cells, we see the most overlap with fibroblasts, endothelial cells, and macrophages. These are consistent with the cell types that we see normal cells line up with in 01-marker-gene-tumor-classification.Rmd.

Conclusions

Session Info

# record the versions of the packages used in this analysis and other environment information
sessionInfo()
## R version 4.4.0 (2024-04-24)
## Platform: x86_64-apple-darwin20
## Running under: macOS Ventura 13.6.6
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Chicago
## tzcode source: internal
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices datasets  utils     methods   base     
## 
## other attached packages:
##  [1] copykat_1.1.0               ggplot2_3.5.1               SingleCellExperiment_1.26.0 SummarizedExperiment_1.34.0
##  [5] Biobase_2.64.0              GenomicRanges_1.56.0        GenomeInfoDb_1.40.0         IRanges_2.38.0             
##  [9] S4Vectors_0.42.0            BiocGenerics_0.50.0         MatrixGenerics_1.16.0       matrixStats_1.3.0          
## 
## loaded via a namespace (and not attached):
##   [1] RColorBrewer_1.1-3        rstudioapi_0.16.0         jsonlite_1.8.8            shape_1.4.6.1             magrittr_2.0.3           
##   [6] farver_2.1.1              rmarkdown_2.26            GlobalOptions_0.1.2       fs_1.6.4                  zlibbioc_1.50.0          
##  [11] vctrs_0.6.5               DelayedMatrixStats_1.26.0 htmltools_0.5.8.1         S4Arrays_1.4.0            forcats_1.0.0            
##  [16] SparseArray_1.4.3         pROC_1.18.5               caret_6.0-94              sass_0.4.9                parallelly_1.37.1        
##  [21] bslib_0.7.0               plyr_1.8.9                cachem_1.0.8              lubridate_1.9.3           lifecycle_1.0.4          
##  [26] iterators_1.0.14          pkgconfig_2.0.3           Matrix_1.7-0              R6_2.5.1                  fastmap_1.1.1            
##  [31] GenomeInfoDbData_1.2.12   future_1.33.2             clue_0.3-65               digest_0.6.35             colorspace_2.1-0         
##  [36] rprojroot_2.0.4           beachmat_2.20.0           labeling_0.4.3            fansi_1.0.6               timechange_0.3.0         
##  [41] httr_1.4.7                abind_1.4-5               compiler_4.4.0            proxy_0.4-27              bit64_4.0.5              
##  [46] withr_3.0.0               doParallel_1.0.17         BiocParallel_1.38.0       highr_0.10                MASS_7.3-60.2            
##  [51] lava_1.8.0                DelayedArray_0.30.1       rjson_0.2.21              ModelMetrics_1.2.2.2      tools_4.4.0              
##  [56] future.apply_1.11.2       nnet_7.3-19               glue_1.7.0                nlme_3.1-164              grid_4.4.0               
##  [61] cluster_2.1.6             reshape2_1.4.4            generics_0.1.3            recipes_1.0.10            gtable_0.3.5             
##  [66] tzdb_0.4.0                class_7.3-22              tidyr_1.3.1               data.table_1.15.4         hms_1.1.3                
##  [71] utf8_1.2.4                XVector_0.44.0            foreach_1.5.2             pillar_1.9.0              stringr_1.5.1            
##  [76] vroom_1.6.5               circlize_0.4.16           splines_4.4.0             dplyr_1.1.4               lattice_0.22-6           
##  [81] renv_1.0.7                survival_3.5-8            bit_4.0.5                 tidyselect_1.2.1          ComplexHeatmap_2.20.0    
##  [86] scuttle_1.14.0            knitr_1.46                xfun_0.43                 hardhat_1.3.1             timeDate_4032.109        
##  [91] pheatmap_1.0.12           stringi_1.8.4             UCSC.utils_1.0.0          yaml_2.3.8                evaluate_0.23            
##  [96] codetools_0.2-20          tibble_3.2.1              BiocManager_1.30.23       cli_3.6.2                 rpart_4.1.23             
## [101] munsell_0.5.1             jquerylib_0.1.4           Rcpp_1.0.12               globals_0.16.3            png_0.1-8                
## [106] parallel_4.4.0            gower_1.0.1               readr_2.1.5               sparseMatrixStats_1.16.0  listenv_0.9.1            
## [111] viridisLite_0.4.2         ipred_0.9-14              scales_1.3.0              prodlim_2023.08.28        e1071_1.7-14             
## [116] purrr_1.0.2               crayon_1.5.2              GetoptLong_1.0.5          rlang_1.1.3